The Inverse Regression Topic Model

نویسندگان

Maxim Rabinovich

David M. Blei

چکیده

Taddy (2013) proposed multinomial inverse regression (MNIR) as a new model of annotated text based on the influence of metadata and response variables on the distribution of words in a document. While effective, MNIR has no way to exploit structure in the corpus to improve its predictions or facilitate exploratory data analysis. On the other hand, traditional probabilistic topic models (like latent Dirichlet allocation) capture natural heterogeneity in a collection but do not account for external variables. In this paper, we introduce the inverse regression topic model (IRTM), a mixed-membership extension of MNIR that combines the strengths of both methodologies. We present two inference algorithms for the IRTM: an efficient batch estimation algorithm and an online variant, which is suitable for large corpora. We apply these methods to a corpus of 73K Congressional press releases and another of 150K Yelp reviews, demonstrating that the IRTM outperforms both MNIR and supervised topic models on the prediction task. Further, we give examples showing that the IRTM enables systematic discovery of in-topic lexical variation, which is not possible with previous supervised topic models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Regression Analysis under Inverse Gaussian Model: Repeated Observation Case

 Traditional regression analyses assume normality of observations and independence of mean and variance. However, there are many examples in science and Technology where the observations come from a skewed distribution and moreover there is a functional dependence between variance and mean. In this article, we propose a method for regression analysis under Inverse Gaussian model when th...

متن کامل

Modeling of the Relationships Between Spatio-Temporal Changes of Traffic Volume and Particulate Matter-2.5 Pollutant Concentration Based on Geographically Weighted Regression (GWR) and Inverse Distance Weighting (IDW) Model: A Case Study in Tehran M

Background and Aim: High concentrations of particulate matter-25 (PM2.5) have been the cause of the unhealthiest days in Tehran, Iran in recent years. This study was conducted with the aim of the spatio-temporal analysis of traffic volume and its relationship with PM2.5 pollutant concentrations in Tehran metropolis, Tehran during 2015-2018, using the Geographic Information System (GIS). Materi...

متن کامل

Some Modifications to Calculate Regression Coefficients in Multiple Linear Regression

In a multiple linear regression model, there are instances where one has to update the regression parameters. In such models as new data become available, by adding one row to the design matrix, the least-squares estimates for the parameters must be updated to reflect the impact of the new data. We will modify two existing methods of calculating regression coefficients in multiple linear regres...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل